\section{Introduction}
We humans rely heavily on visual perception. We see patterns, shapes and
textures and we can recognize objects and entities. Among us, whe can then use
language to communicate about the world. If we wan't to create artificial agents
capable of performing tasks with visual objects or data, but expressed by us in
natural language, the agent needs a way to associate language with visual input.
The agent needs to be able to use, and possibly create, a \emph{word-image lexicon}.

Let's say we tell our house robot, which recognize objects using vision, to
fetch our slippers. In order for the robot to recognize the slippers, the
robot must than be able to translate the word slippers to the visual
characteristics of our slippers. The same translation mechanism could also be
used by pure software agents, such as search engines searching for images
corresponding to a search phrase.

\subsection{Learning a lexicon}
Children have a natural ability to connect real world objects
with words. Not only do they associate visual characteristics with
certain objects, but can also recognize objects by their function and
use. For a child, this process takes many years. We are interested
in more simple systems that can use the vast amounts of available
text-image material available in books, newspapers and on the web. We
also limit ourself to only use visual characteristics as means to
recognize objects.

\subsection{Our approach}
% Mainly from project proposal:
We have designed a system with four main components:
\begin{itemize}
\item \textbf{Object-detection} A component that takes an image as
  input and outputs regions of the image, \emph{blobs}, containing the
  individual objects in the image.

\item \textbf{Image-recognition} A component that given two blobs compares them
  and estimate how alike they are. This is then used to determine whether
  the blobs contain the same object.

\item \textbf{Natural Language Processing} A component that given the text in
  page, extracts the words that represents objects that might exist in the image.

\item \textbf{Reasoning system/Machine Learning} A component that uses all the
  other components. Given the blobs and the words representing objects, it
  compares material from all the pages in the book and deduces which blobs
  that are similar and, with reasonable confidence, can be labelled with a
  word representing the object in the blob.
\end{itemize}

\subsection{A simple example}
Throughout our project we have chosen to work with data collected from
picture books for young children. They have been chosen for characteristics such
as simple texts, simple images (where objects are easy to distinguish from
each other) and images representing much of the text content.

Reading a book should result in a small lexicon of words with some
images of the object that the word represents, for example:

\begin{figure}[H]
    \begin{center}
    \includegraphics[width=0.50\textwidth]{CalvinHobbes.jpg}
    \caption{Page 1: This is Calvin and Hobbes.}
    \end{center}
\end{figure}

Analyzing this, we get two objects (or possibly some more) and two
words \emph{Calvin} and \emph{Hobbes}.

\begin{figure}[H]
    \begin{center}
    \includegraphics[width=0.50\textwidth]{calvin.jpg}
    \caption{Page 2: Calvin has a ball.}
    \end{center}
\end{figure}

Now, given page 1 and 2, we could deduce what object is Calvin, what
object is the ball and what object is Hobbes and build a lexicon consisting of
these words and images.

The text in books are no way near as simple as the example above. Many objects that
can be found in an image, will typically not be mentioned in the text. Cartoon
images are sometimes difficult to compare and recognize. And, many different
words, such as \emph{Calvin, boy, son, he}, might be used in the text when
referring to a single object.

\subsection{The report}
In this report, we present our system for automatic generation of
word-image lexicons. The implementation is described in the
\emph{Solution} section. We mainly use known solutions, some of which we
knew about before designing the system and some which we have later
discovered to be widely used in the field.

We have chosen to work, only with small datasets, picture books for
small children. As can be seen in the \emph{Results} section, this has
turned out to be a major limitation for the end result. However, parts
of the system performs fairly well, even on this small dataset. This
would indicate that it is not at all theoretically impossible to get
good results with picture books. 

\begin{comment}
\subsection{What problems or tasks did you tackle?  Give needed definitions.}

\subsection{What interesting and related things are outside the scope of your project?}

\subsection{Explain the problem by a naïve solution.  What problems does the naïve approach have?}

\subsection{What was your approach? Why did you take it?}

\subsection{Did you succeed?  In what?  Where did you fail? What have you accomplished? Why is it interesting and significant?}
\end{comment}

